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ABSTRACT 

Aim: The aim of this study is to look for the proper methods that would be a major step towards untreated CD diagnosis 
and seek the metabolic biomarkers causes of CD and compare them to control group. 

Background: Celiac disease (CD) is a common autoimmune disorder that is not easily diagnosed using the clinical tests. 
Patients and methods: Thirty cases and 30 controls were entered into this study. Metabolic profiling was obtained 
using proton nuclear magnetic resonance spectroscopy ('HNMR) to seek metabolites that are helpful for the detection of 
CD. Classification of CD and healthy subject was done using random forest (RF). 

Results: The obtained classification model showed an 89% correct classification of CD and healthy subject for the 
extemal test set. The metabolites that caused changes in people with CD were identified using RF; these metabolites 
include lactate, valine and lipid. 

Conclusion: The findings of the present study reveal serum lactate, valin and lipid levels in CD patient are lower than 
healthy cohorts. This metabolite may provide diagnostic tools as well as insight into potential targets for disease therapy. 
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Introduction 

Celiac disease (CD) is a common systemic 
disorder, which can have multiple clinical 
manifestations. It has a multi factorial etiology with a 
complex genetics and histology. A comparison of 
recent studies in European and Middle Eastern 
countries has shown that CD is common in both 
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areas, vwth an almost similar prevalence (1). Despite 

advances in investigation techniques, CD remains a 
challenging problem that often eludes diagnosis and 
receives sub-optimal attention (2). In this regard, 
metabonomics can provide powerflil techniques for 
CD diagnosis. 

Metabonomics is described as the quantitative 
measurement of the multi-parametric metabolic 
response of living systems to pathophysiological 
stimuli or genetic modification(s) (3, 4). This 
quantitative measurement can provide multivariate 
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metrics of potential metabolic dysflmction in any 
living system. There are several available analytical 
spectroscopic methods to interpret the profiles of 
metabolism in the biological sample such as urine, 
plasma or tissue. In the biological systems, proton 
nuclear magnetic resonance spectroscopy (^HNMR) 
is a useful method providing valuable data of the 
metabolites (5, 6). Various metabonomics studies are 
accomplished on CD. For instance, Bertini, et al. 
define the metabolic signature of CD through NMR 
of urine and serum samples of CD patients. In recent 
study, >JMR metabolic profiling of their serum 
and urine samples examined a cohort of CD patients, 
before and after gluten free diet (GFD), and healthy 
controls. The results indicated that altered serum 
levels of glucose and ketonic bodies suggest 
alterations of energy metabolism, while the urine 
data point to alterations of gut microbiota(7). In 
another review, Bernini and co authors (8) address 
potential CD patients, defined as subjects who do not 
have, and have never had, a jejuna biopsy consistent 
with clear CD, and yet have immunological 
abnormalities similar to those found in celiac 
patients. They demonstrated that metabolic 
alterations may precede the development of small 
intestinal villous atrophy and provide a further 
rationale for early institution of GFD in patients with 
potential CD, as recently suggested by prospective 
clinical studies (8). 

Leo Breiman has recently developed the random 
forests (RF) (9) that is based on classification and 
regression methods. RF reduces the variance and 
improves the prediction accuracy. In this study, we 
propose to apply RF for discriminating the control 
and CD subjects. In order to achieve this purpose, we 
seek the significance of metabolic biomarkers that 
can lead to the classification of these two groups. 

Patients and Methods 

Sample population 

Thirty blood samples from adult CD patients (14 
males and 16 females with mean age (±standard 



deviation) 34±11 years) and 30 healthy subjects 
(HS) were collected as described previously (10). 

NMR spectroscopy 

'H NMR experiments were acquired on a 
Bruker DRX 500 MHz spectrometer equipped with 
a 5mm NMR tube for analysis. The detail of this 
technique was presented in our previous study (11- 
13). Metabolites present in serum samples were 
identified on the basis of several previous studies 
(14-16). 

Random forest (RF) 

Random Forest (RF) classifiers are used to 
classify the serum samples of healthy and CD 
subjects and seeking the fundamental metabohtes 
for desecrating (9). RF is a modified non-linear 
classification and regression trees (CART) method 
providing an importance ranking for the 
effectiveness of each metabolite. CART maximize 
the difference of heterogeneity, but the over fitting 
problem causes the classifier to have a high error of 
prediction in the test set whereas the bagging 
mechanism in RF algorithm can improve over 
fitting problem (18). The RF algorithm is illustrated 
in Ref (9). This algorithm builds every tree that is 
different owing to two factors. In first step, a best 
split is chosen at each node. This selection occurs 
from a random subset of the predictors rather than 
all of them. In next step, a bootsfrap sample of the 
observations builds every tree. The out-of-bag 
(OOB) data are one-third of the observations. They 
can be used to estimate the prediction accuracy. 
Finally, based on averaging over all the frees is 
calculated overall prediction. 

RF package is a readily accessible 
implementation of the RF algorithm and can be 
downloaded from the website. The data 
preprocessing and the modeling was executed 
utilizing MATLAB (version 6.5.1, The Math works, 
Cambridge, UK). RF has been applied on mean- 
centered data set and validated by predicting the 
classes of test set not used in the fraining set (17). 
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Results 

A total of 42 different training and 18 test sets 
were built by random splitting for NMR spectra. 
Test set contain about 1/3 of the samples. In the 
classification model, these descriptive variables 
are the integral at the difference of the chemical 
shift in NMR spectra while the class numbers of 
the different samples were employed as response. 
According to RF model, we concluded that using 
three descriptors, CD and control groups could be 
classified. These metabolites include lactate, 
valine and lipid. Table 1 presents the summaries 
of the metabolite level distributions which are 
considerably (P-value < 0.001) different between 
CD patients and control group. Table 1 depicts 
serum lipid, valine and lactate levels in CD patient 
are lower than healthy group. 

Table 1. Metabolites present in serum samples of celiac 

patients and control a 

Metabolite Assignment 'H chemical CD 

shift (ppm) group 

Lactate PCH3 1.32 J, 
Valine 6CH3 1.03 i 
Lipid CH2CH2CO 1.56 i 

" The aiTows (J.) indicate decrease of metabolites levels in CD group. 
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Figure 1. Plot of OOB error for RF classification of CD 
and control group 

Samples of the training set were classified 
using RF in which 500 trees were grown. The 



OOB data was used to estimate the prediction 
accuracy of classification. Figure 1 presents the 
OOB error rate. The RF algorithm considers 
how much prediction error increases. At this 
time OOB data for that variable is permuted 
while all other variables are left unchanged. By 
this method the significance of a variable can be 
estimated. Confusion matrix is a tool to 
illustrate the relations between real class and 
predicted classes. Table 2 presents confusion 
matrix of the RF model for the training and test 
set. In detecting CD patients of test set, as it is 
clear from Table 2, RF model has an accuracy of 
0.89. With respect to these results, RF model 
has great chance in diagnosis of CD. 

Table 2. Confusion matrix for training and test set. 

Predicted 

Observed CD class Healthy class 
Training set CD class 20 1 

Healthy class 2 19 
Test set CD class 8 1 
Healthy class 1 8 

Table 3 reports the classification specificity and 
other classification parameters for each individual in 
the training and test set. Another evidence for 
capability of RF model in CD diagnosis is the high 
non-error rate in the external test set. 

Table 3. The calculated error and non-error rates of the 
classification index and the classification performances of 
training and test sets 

Set Error Non- specificity sensitivity accuracy 

rate error rate 
Training 0.07 093 095 091 093 

Test 0.11 0.89 0.89 0.89 0.89 

Discussion 

Lipids are a group of naturally occurring 
molecules such as types of vitamins, 
monoglycerides, and others. Lipids act as 
structural components of cell membranes. The 
majority of lipids in biological systems include 
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energy storage (18, 19). Krums and et.al stated 
lipid metabolism was evaluated in patients with 
CD (20). The reason of this function in patients 
with CD is disorders of lipid metabolism in the 
small intestine. Lewis and co-authors investigated 
cholesterol profile in people with newly diagnosed 
CD (21). They suggested that untreated CD is 
associated with lower total cholesterol than in the 
general population. Bertini, et al. (7) expressed 
that lipid oxidation should be increased in CD. In 
these cases intake of lipids is reduced because of 
malabsorption in CD patients. They explained that 
the lower levels of lipids in sera to be due to an 
enhanced lipids oxidation and malabsorption. Also 
they found valine and lactate levels in the serum of 
healthy individuals to be less than those found in 
CD patients. Our classification results are better 
than results of Bertini and coauthors (7). They 
found the classification accuracy of CD and healthy 
control groups was 79.7-83.4% for serum and 
69.3% for urine. We applied the RF method for 
classification that has optimization parameters 
which are less than support vector machines (SVM) 
(applied method in Ref (7)). 

An essential amino acid is valine that must be 
ingested. Valine both in structure and function is 
closely related to leucine and isoleucine. These 
amino acids are important for supplying energy to 
muscles and increase endurance and aid in muscle 
tissue recovery and repair. Hemanz, et.al analyzed 
amino acid concentrations in plasma from confrol 
and freated and untreated patients with CD (22). 
They found both treated and unfreated cohorts had 
significantly decreased plasma concentrations of 
citruUine, tyrosine, valine, isoleucine, and leucine 
compared with control cohorts. In another review, 
Bernini et.al stated that glycolysis is somehow 
impaired in CD explaining both a lowering of 
lactate levels and an increase of glucose levels in 
blood (8). The body produces lactate in throughout 
the day. It is actually an important fuel used by the 
muscles during prolonged exercise. Also Bertini 
and coworkers defined metabonomics for CD in 



three components including malabsorption, energy 
metabolism and the third related to alterations of 
gut microflora (7). 

In Conclusion, metabonomics and analysis of 
the important above mentioned metabolites in 
serum is applied widely in early stage of CD 
disease. Due to the results RF proved to be quite 
powerful in discriminating between CD and 
healthy subjects. 
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